Based on notes by Professors A.Muñoz, J.Portela & J.Pizarroso
Published
January 2025
Introduction
Moving Along the Modeling Process
In our previous session we show to preprocess a data set for a classification problem, and so we are ready to move to the last three steps of our plan to solve that problem:
Choose a model
Fit the parameters of the model
Assess the model quality
Logistic Regression
In this session we will meet our fist Machine Learning model. Logistic Regression is a classical model for binary classification problems. It can often be considered as a baseline or reference model. Logistic Regression is fast to train and fast to predict. And it is arguably the best of close to the best model in terms of interpretability. Therefore any fancier, more sophisticated model is usually compared against logistic regression to decide if the costs in terms of computation and interpretability are reasonable.
Probabilistic vs Hard Approach
Recall that a classification method uses the probabilistic (soft) approach if it predicts the conditional probability that the output \(y\) equals 1, given an input value \(\mathbf x\). \[P(y = 1 | \mathbf x) = f(\mathbf x)\] Logistic regression is an example of a probabilistic classification method. A classification method that directly predictis the class uses the hard approach. Keep in mind that every probabilistic classifier can be turned into a hard classifier by setting a probability threshold.
Probabilistic vs Hard Approach in Logistic Regression
The probabilistic approach in Logistic Regression uses the following equation to model conditional probabilities: \[P(y = 1 | \bar x) = \dfrac{e^{b_0 + \bar b_1 \bar x}}{1 + e^{b_0 + \bar b_1 \bar x}}\] Where, in general, \(\bar x\) is a multidimensional input vector.
The One Dimensional Case.
Logistic Curves
We will first look at the one dimensional case: a classification problem with a single numeric predictor. This will help us understand the higher dimensional logistic regression.
When the input \(x\) is one dimensional the expression \[p(x) = \dfrac{e^{b_0 + b_1 x}}{1 + e^{b_0 + b_1 x}}\] defines a family of sigmoid (s-shaped) curves called logistic curves like the one below, with heights between 0 and 1. You can use the following link to explore them. Dynamic Logistic Curve Exploration
Python Example for the One Dimensional case
To illustrate logistic regression in the one dimensional case we will use the dataset stored in the Default.csv file (already in this session’s folder). This synthetic dataset is one of the example files used in (James et al. 2023, see the References below) and you can find specific information about the dataset in this link: https://islp.readthedocs.io/en/latest/datasets/Default.html
\(\quad\)
The dataset has a binary output target Default (levels No and Yes) and four input variables (predictors): one binary factor Student and two numeric inputs, Income and Balance. Sine we want to start with a one dimensional example we will be dropping two inputs, and so we will try to predict Default from the values of balance alone.
\(\quad\)
Besides, even though the dataset contains a large number of rows, we will consider a smaller sample with only \(N = 1200\) rows of the table (in fact, two such samples). The reason for this is that we want to discuss below the influence of the particular sample data in the modeling process. And this impact is easier to show with smaller samples.
Load the Dataset
Remember that it is always recommended to explore the file with a text editor beforehand.
Code
import numpy as npimport pandas as pd import matplotlib.pyplot as pltimport seaborn as snsimport sklearn as skdf = pd.read_csv('Default.csv')df.head(4)
default
student
balance
income
0
No
No
729.526495
44361.625074
1
No
Yes
817.180407
12106.134700
2
No
No
1073.549164
31767.138947
3
No
No
529.250605
35704.493935
Drop Unused Columns and Set Variable Types
We drop two inputs and set the type of the output to categorical, checking the result.
Check for Missing Values and Take a Random Sample of the Dataset
This synthetic dataset does not contain missing values.
Code
df.isnull().sum(axis=0)
default 0
balance 0
dtype: int64
We will only consider 1200 rows of the dataset, sampled without replacement. We use the sample function from Pandas (read the docs!) to do this and define standard names X and Y.
Setting the random_state in your code makes it reproducible and is very important (and required in exams and assignments).
Split the Data into Training and Test
We will use 80% of the sample for training. Remember that we use stratified splits.
Code
from sklearn.model_selection import train_test_splitXTR, XTS, YTR, YTS = train_test_split(X, Y, test_size=0.2, # percentage preserved as test data random_state=1, # seed for replication stratify = Y) # Preserves distribution of y
Check for Outliers
No relevant outliers appear in the boxplot for balance.
Basic balance Statistics and Check for Imbalanced Default Classes
Observe that the difference in proportion between the two levels of the output (< 4% positive cases) turns this into a severely imbalanced classification problem.
Keep in mind that classification problems with imbalanced data such as this one are usually much harder.
Distribution of the Numerical Input balance
The density curve for balance is symmetrical enough and bell-shaped, so we will not apply any further transformation, considering it as approximately normal.
Code
sns.set(rc={'figure.figsize':(10,4)});sns.kdeplot(XTR, x ="balance");
Back to the Logistic Curve
To improve our understanding of Logistic Regression, let us plot the data. In this plot the horizontal axis shows the input variable and the vertical coordinates 0 and 1 correspond to the two levels of the output \(Y\).
Code
sns.set(rc={'figure.figsize':(10, 3)});sns.scatterplot(XTR, x ="balance", y=YTR, hue=YTR);
What is the modeling signal that we see in this plot? In a nutshell: higher values of balance correspond to an increased probability of default. Remember that this is an imbalanced dataset!
Now imagine that we divide the range of the input into many bins:
and for each bin we obtain the proportion of data points with \(Y = 1\).
In the following picture the height of the green points represent precisely these proportions for each bin.
Code
%run -i 2_2_plot_logistic_curve-01.py
And you can see that they are beginning to look like an s-shaped curve.
In the figure below we have drawn by hand a sigmoid curve, to invite you to think about these closely connected questions:
Is this the best sigmoid curve that we can fit to the data? How do we decide that a curve is better than another one?
And in particular, how do we choose the values \(b_0\) and \(b_1\) in \[f(x) = \dfrac{e^{b_0 + b_1 x}}{1 + e^{b_0 + b_1 x}}\] that corresond to the best curve? Both theoretically and with Python!
That is the subject of the next section.
Fitting the Logistic Regression Model
Likelihood
You probably know that in linear regression the least squares method is used to fit a linear model to the data. We will discuss that problem in due time, when we talk about regresion methods. , could we use the least squares method here to fit the sigmoid curve? The answer is, in principle, yes. But the resulting curve and model would not have good probabilistic properties. Therefore we will use an approach that takes probability as starting point. \(\,\)
The likelihood function corresponding to a choice of \(b_0, b_1\) in the sigmoid curve and a given dataset \((x_0, y_0),\ldots, (x_n, y_n)\) is: \[
{\mathcal{L}}(b_0, b_1) = \prod_{i: y_i = 1}p(x_i)\,\prod_{j: y_j = 0}(1 - p(x_j)),\qquad\text{where}\quad
p(x) = \dfrac{e^{b_0 + b_1 x}}{1 + e^{b_0 + b_1 x}}
\] The idea behind this equation is that the observations in the dataset are independent (i.e. we assume sampling with replacement) and therefore he likelihood represents the probability of those coefficient values given the data we observed.
Then we can use this as our fitting criterion: the fit model will be one with coefficients that maximize the likelihood.
Log-Likelihood
Most optimization methods are implemented to find minimum values of functions. Besides, instead of a big product it is simpler to work with the derivatibves of a sum. Therefore, we consider the logarithm of the likelihood, called the log-likelihood defined as follows (up to a positive constant factor) \[
\log(\mathcal{L})(b_0, b_1) = \sum_{i = 1}^n \left(y_i \log(p(x_i)) + (1 - y_i) \log(1 - p(x_i))\right)
\] To see why this is the logarithm of the likelihood keep in mind that for each data point \((x_i, y_i)\) we have either \(y_i = 1\) or \(y_i = 0\). Thus only one of the terms in the inner sum is used.
And we reformulate our fitting criterion: the fit model will be one with coefficients that minimize \(-\text{(log-likelihood)}\) (i.e., minus the logarithm of the likelihood).
Logistic Regression Model and scikit-learn Pipelines
The pipeline framework that we have seen for preprocessing is easily extended with a modeling step using LogisticRegresion from scikit-learn. Note that preprocessing in this one dimensional case is reduced to scaling. The fitting criterion we have described is implemented into the LogisticRegression object. We therefore extend the pipeline with this additional modeling step as follows:
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In your notebook or in the html version of this document you will see an interactive diagram describing the structure of the fitted pipeline.
Coefficients of the Model
The coefficients \(b_0, b_1\) of the fitted model can be recovered as properties of the LogReg step of the pipeline. We use named_steps to inspect the step we are interested in.
Here we also use the functions/objects hstack, newaxis and ravel to shape the coefficients into a convenient structure.
The Fitted Coefficients use the Transformed (Scaled) Variables
In scikit-learn the coefficients of the fitted model correspond to the scaled input variables, and can not be directly interpreted in terms of the original ones (more details on this below).
Exercise 001.
Look up the numpy functions and objects in the previous code cell using the links provided and figure out what they do in general and in our context.
Plotting the Logistic Model Curve
The remark about the scale of the fitted model in the previous section means that in order to plot the fitted curve we need to use the data transformed by the first step of the pipeline.
Code
LogReg_pipe.named_steps["scaler"]
StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Optional Detour: from Scratch Likelihood and Model Fitting
Python has several libraries that can be used to optimize functions. In particular this means that we can implement the -log-likelihood function by ourselves, and use one of these libraries to fit the logistic regression model. This is the purpose of the script
Code
# %run -i 2_2_Likelihood_and_Fit_From_Scratch.py
Exercise 002
Remove the comment in the previous cell, execute the cell to actually load the code, examine it and finally execute it again to see the results.
Back on the Track: Classical Inference in Logistic Regression
It is important to understand that the Logistic Model that we fitted depends on the particular sample that we are using. To get a better perspective on this essential point, we will now take a second independent sample with another 1200 points from the Default data set and fit a model to this second dataset.
Exercise 003
The code to do this is in the script that apppears in the next cell. As before, remove the comment, execute the cell and load the code, examine and execute it again.
Code
# %run -i 2_2_Logistic_Model_Second_Sample.py
Theoretical Model vs Fitted to Data Models
The result of the previous exercise illustrates that the models we fit are always data-dependent. We think of the coefficients \(b_0, b_1\) for each of these models as estimates of the coefficients of a theoretical model that describes the relation between \(X\) and \(Y\): \[P(y = 1 | x) = \dfrac{e^{\beta_0 + \beta_1 x}}{1 + e^{\beta_0 + \beta_1 x}}\]
Note that here we use \(\beta_0, \beta_1\) to represent the theoretical coefficients, as usual in Statistics. In this context, the coefficients that we have been calling \(b_0\) and \(b_1\) are estimates of the theoretical coefficients and we denote this with the symbols \[\hat\beta_0 = b_0,\, \hat\beta_1 = b_1\]
Inference on the Model Coefficients. Statsmodels vs Scikit-Learn
Since \(\hat\beta_0, \hat\beta_1\) are estimates of \(\beta_0, \beta_1\) we can use Statistical Inference to obtain confidence intervals and p-values to test the null hypothesis that \(\hat\beta_1\neq 0\). Then we use the test to decide if the input variable \(x\) is in fact useful to predict the output.
In Python we have two libraries that can be used for this purpose: statsmodels and scikit-learn. The former is more oriented towards statistical inference and the latter towards machine learning. If we want to get confidence intervals and p-values we should use statsmodels. We do this in the cells below, using the Logis function from statsmodels. This function requires that we add a constant column of ones as the first column of the input data. The reason for this is that, in linear models, and in order to use Linear Algebra machinery, the constant term is treated as a coefficient of a variable that is always 1, like this: \[\beta_0 + \beta_1 x = \left(\begin{array}{cc}1 & x\end{array}\right)\left(\begin{array}{c}\beta_0 \\ \beta_1\end{array}\right)\]
Code
import statsmodels.api as smXTR_sm = sm.add_constant(XTR)XTR_sm.head()
const
balance
1537
1.0
879.373824
7707
1.0
1845.976439
2165
1.0
1026.104495
2445
1.0
1142.935257
2899
1.0
1155.175006
Now we are ready to fit the model with statsmodels
Optimization terminated successfully.
Current function value: 0.090488
Iterations 9
and to obtain the confidence intervals and p-values for the coefficients using the summary method of the fitted model:
Code
logis_mod_sm_res.summary()
Logit Regression Results
Dep. Variable:
default
No. Observations:
960
Model:
Logit
Df Residuals:
958
Method:
MLE
Df Model:
1
Date:
Tue, 21 Jan 2025
Pseudo R-squ.:
0.3655
Time:
10:11:24
Log-Likelihood:
-86.868
converged:
True
LL-Null:
-136.92
Covariance Type:
nonrobust
LLR p-value:
1.453e-23
coef
std err
z
P>|z|
[0.025
0.975]
const
-9.0776
0.938
-9.681
0.000
-10.915
-7.240
balance
0.0045
0.001
7.664
0.000
0.003
0.006
Coefficients of the Model and Transformed Variables
If you recall the values \(b_0, b_1\) of the coefficients for the model fitted with scikit-learn you will see that they are different from the values \(\hat\beta_0, \hat\beta_1\) of the coefficients for the model fitted with statsmodels above. This is because the scikit-learn pipeline uses a scaler to transform the input data before fitting the model. The coefficients of the model are then the coefficients of the model fitted to the transformed data.
The data we used to fit the model in statsmodels is not scaled. If we want to compare both approaches we should use the same data. And so below we will fit the model with statsmodels using the scaled data. Recall that the scaled data is stored in the XTR_transf variable.
Optimization terminated successfully.
Current function value: 0.090488
Iterations 9
Logit Regression Results
Dep. Variable:
default
No. Observations:
960
Model:
Logit
Df Residuals:
958
Method:
MLE
Df Model:
1
Date:
Tue, 21 Jan 2025
Pseudo R-squ.:
0.3655
Time:
10:11:24
Log-Likelihood:
-86.868
converged:
True
LL-Null:
-136.92
Covariance Type:
nonrobust
LLR p-value:
1.453e-23
coef
std err
z
P>|z|
[0.025
0.975]
const
-5.3125
0.472
-11.266
0.000
-6.237
-4.388
balance
2.1259
0.277
7.664
0.000
1.582
2.670
Now the coefficients of both fitted models look similar (the difference is due to the different optimization libraries used by scikit-learn and statsmodels).
P-value interpretation
In this case we see that the p-value for the balance coefficient (\(\beta_1\)) is very small. Thus we reject the null (that says that this coefficient is zero) and conclude that balance in fact explains a part of the variability in Default. If you want to examine the p-values in detail use the code below:
Logistic Regression Models for Classification Problems with Multidimensional Inputs
Up to this point we have considered an example of a binary classification problem with one-dimensional input. But at the beginning of this session we introduced Logistic Regression in a multidimensional setting, with the equation: \[P(y = 1 | \bar x) = \dfrac{e^{b_0 + \bar b_1 \bar x}}{1 + e^{b_0 + \bar b_1 \bar x}}\] in which the input \(\bar x\) is multidimensional (with fimension \(p\)), and the so called logit\[w = b_0 + \bar b_1 \bar x = b_0 + b_1 x_1 + \cdots + b_p x_p =
\left(\begin{array}{cc}1 & x_1 & \cdots & x_p\end{array}\right)\left(\begin{array}{c}b_0 \\ b_1 \\ \vdots \\ b_p\end{array}\right)\] is linearly dependent on \(\bar x\). Note that the last matrix expression is the analoque of the one we used to incorporate the constant term in the one dimensional case.
For example in a 2-dimensional problem where \(\bar x= (x_1, x_2)\) instead of a sigmoid curve we get a logistic surface illustrated in the figure. Note that the level curves of such surface are still hyperplanes (straight lines) in the \((x_1, x_2)\) plane.
Python Example of Multiple Logistic Regression
In the previous session we preprocessed an example with multidimensional inputs stored in the Simdata0.csv file. We will now go back to that file and use it to demonstrate how to fit a Logistic Regression model in this case. You will see that (at least initially) there is really not much of a difference with the one dimnsional case.
Exercise 004
The script in the next cell can be thought of as a draft for a preprocessing template. You have done this already, so remove the comment, execute the cell and load the code, examine and execute it again. The result will be a collection of training and test sets XTR, YTR, XTS, YTS. Do some basic EDA of this sets to make sure that you connect them with our work from the previous session. What part of preprocessing has already been completed and what remains to be done?
Code
# %load "2_2_Exercise_001.py"
Code
%run -i "2_2_Exercise_001.py"
Preprocessing completed. Train and test set created.
Visualizing the Preprocessed Dataset
After the initial preprocessing in the script 2_2_Exercise_001.py the inputs of the XTR dataset are X1, X2 and the one hot encoding of X4 into two variables X4_A and X4_B. Let us do a scatterplot to visualize this situation. We use color to represent the two classes of the output YTR and we also use size to represent X4_A.
Code
sns.set(rc={'figure.figsize':(10, 3)});sns.scatterplot(x = XTR["X1"], y = XTR["X2"], size=XTR["X4_A"], hue=YTR);
Pipeline for the Model
The preceding plot confirms the parabola-shaped boundary between the two outplut classes in the \(X_1, X_2\) plane. But it also shows a clear relation between the values of \(X_1\) and \(X_4\). We will see how this is reflected in the model below. But first we need to create the pipeline for the model:
The preprocessor step includes a scaler for numerical inputs and leaves the categorical inputs untouched. We already covered this step in the previous session.
After that we simply add the Logistic Regression model step as in the one dimensional case.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In this case we have not included a command to put the transformer output in Pandas dataframe format. The (default) result is a NumPy array to simplify some downstream processing.
Fitting the Model and Checking the Coefficients
This is as simple as it was in the one dimensional case, just call the fit method. And then we can examine the coefficients of the fitted model.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Optimization terminated successfully.
Current function value: 0.338626
Iterations 9
Logit Regression Results
Dep. Variable:
Y
No. Observations:
783
Model:
Logit
Df Residuals:
779
Method:
MLE
Df Model:
3
Date:
Tue, 21 Jan 2025
Pseudo R-squ.:
0.4763
Time:
10:11:24
Log-Likelihood:
-265.14
converged:
True
LL-Null:
-506.30
Covariance Type:
nonrobust
LLR p-value:
3.257e-104
coef
std err
z
P>|z|
[0.025
0.975]
const
0.9118
8.58e+06
1.06e-07
1.000
-1.68e+07
1.68e+07
num__X1
0.1950
0.178
1.097
0.273
-0.153
0.543
num__X2
-2.9133
0.212
-13.723
0.000
-3.329
-2.497
cat__X4_A
0.1740
8.58e+06
2.03e-08
1.000
-1.68e+07
1.68e+07
cat__X4_B
0.7377
8.58e+06
8.6e-08
1.000
-1.68e+07
1.68e+07
Note that the only significative coefficient corresponds to X2. This is not a big surprise because Logistic Regression is a linear model and left to itself it fails to detect a non linear boundary between the classes. Looking at the scatterplot you will realize why the output is this.
What Remains to Be Done in this Model?
We can improve this model by addressing the non linearity of the class border in the \(X_1, X_2\) plane. To do that we can e.g. add an extra column with values of \(X_1^2\) and fit a second logistic model with that extra column. We can also drop \(X_4\) from the model: the result of the hypothesis test and the EDA plots indicate that \(X_4\) is not neeeded because knowing \(X_1\) determines it completely (see also the plot below).
These ideas belong to Feature Selection and Engineering. We will return to them in the following session, after we learn about a new type of model. But first we will deal with model performance measures.
Code
sns.set(rc={'figure.figsize':(10, 3)});sns.boxplot(XTR, x ="X1", hue ="X4_A");plt.show();plt.close();
Prediction and Model Performance Measures in Classification
Model Predictions
We have already fitted some models and so it is time to ask ourselves how good they are. And to measure their performance we need to look at the model predictions, using the methods provided by the pipeline. We store the predictions along with the train data (inputs and output) in a new dataframe; this will become handy when we need to compare models.
Code
dfTR_eval = XTR.copy()dfTR_eval['Y'] = YTR
Probability (Soft) Predictions
Recall that the predictions of logistic models come in the form of probability estimates. We obtain them with the predict_proba method like this:
Check the result exploring dfTR_eval with the head method.
Class (Hard) Predictions
To turn a probability prediction into a class prediction we need to set a threshold. For example, with a threshold of \(0.7\) we assign class \(Y = 1\) to every observation whose predicted probability for \(Y=1\) is greater or equal to \(0.7\). And of course we predict \(Y = 0\) for the remaining observations. We will discuss below how to choose that threshold.
The scikit-learn pipeline includes a predict method that uses a default \(0.5\) threshols to predict the class.
Code
dfTR_eval['Y_LR_pred'] = LogReg_pipe.predict(XTR)
Exercise 007
Again use the head method to check the result. And confirm that the threshold used equals \(0.5\). Hint: see the pandas crosstab function. How would you define Y_LR_pred using a \(0.7\) threshold?
The Confusion Matrix
After (choosing the threshold and) predicting classes, the confusion matrix summarizes the agreement between the model predictions and the true values of \(Y\) in a two way table. The scikit-learn.metrics provides ConfusionMatrix to obtain this table and ConfusionMatrixDisplay for a convenient visualization..
Let us obtain the confusion matrix for the train set:
Performance Measures Defined Using the Confusion Matrix
The main diagonal elements in the confusion matrix represent the cases where the model is doing right: true positives and true negatives. The elements in the other cells represent the two types of classification mistakes that the model can make: false positives and false negatives.
Precision and recall are widely used to measure theperformance of a classifier when the importance of both classes is not the same. But neither of them by itself gives a full picture. The F1-measure (or F1-score) tries to combine them into a single number as follows:
\[\text{F1-score} = \dfrac{2\cdot\text{Precision}\,\cdot\,\text{Recall}}{\text{Precision} + \text{Recall}} \quad (\text{Harmonic mean of Precision and Recall})\]
The closer than F1 is to 1, the better. You can get its value from scikit-learn as follows.
Other Class Prediction Performance Measures: Kappa
The Kappa (Cohen’s) score tries to take into account the class distributions of the training set samples. It is defined as \[\kappa = \dfrac{\text{Accuracy} - \text{RandomAgreement}}{1 - \text{RandomAgreement}}\] where \[ \text{RandomAgreement} = \dfrac{(TP \cdot FP)(TP \cdot FN)}{Total^2} + \dfrac{(TN \cdot FP)(TN \cdot FN)}{Total^2}\] and \(Total = TP + FP + TN + FN\). The \(\text{RandomAgreement}\) term measures the probability of correcty selecting the class when coosing at random.
Kappa is standardized to take values between -1 and 1. A value of 1 means perfect classification, 0 corresponds to random class selection, and negative values indicate performance worse than randomly choosing. Kappa values from 0.6 to 1 is sometimes interpreted as substantial to excellent agreement.
Performance Measures Based on Probability Prediction.
We have emphasized in the previous paragraphs that the performance measures we have been discussing assume in all cases that we are using classes as predictions (what we called the hard approach to classification). But the class prediction is often the result of selecting a threshold and applying it to a classifier that predicts scores or probabilities. And therefore its performance depends on the choice of the threshold. In this section we are going to explore this dependency and introduce some performance measures directly based on the probabilities or scores.
Exercise 009
In a previous exercise we asked you to define Y_LR_pred using a \(0.7\) threshold. Do it now using a general threshold stored in the variable thrs (initially set to 0.7). Store the resultin class predictions in a column of dfTR_eval called Y_LR_pred_thrs. Use that column to compute the true positive rate (tpr) and false positive rate (fpr) corresponding to that threshold.
Important: after doing that, and without using Python what do you think will happen to tpr and fpr as you move the threshold closer to 1? And if you move it closer to 0? Think it through before checking your ideas with Python!
Code
# %load "../exclude/code/2_2_Exercise_009.py"
ROC Curve
The ROC curve (receiver operator characteristic curve, terminology from Signal Theory) is defined using all possible thresholds. The points in the curve have coordinates (fpr, tpr), going from (0, 0) for a threshold equal to 1, to (1, 1) when the threshold is 0. In scikit-learn we can get these values of fpr, tpr and the associated threshold with roc_curve.
For high thresholds (we store all three values in a pandas dataframe):
To actually plot the ROC curve we will use the following code. We will expain shortly some elments of that plot.
Code
from sklearn.metrics import RocCurveDisplayRocCurveDisplay.from_estimator(LogReg_pipe, XTR, YTR, plot_chance_level=True);plt.show();plt.close();
ROC Curve and Classifier Performance. AUC.
How can we use the ROC curve to judge the performance of a classifier? A good classifier is expected to have low fpr (close to 0) and high fpr (close to 1). IN fact a perfect classifier (one that always gets it right) has precisely fpr = 0, tpr = 1. Therefore a good classifier has a ROC curve that comes very close to the upper left corner of the plot (the point (0, 1)).
In particular that means (because the ROC curve is monotonic) that the area of a good classifier is close to 1, the total area of the square. That leads to the definition o a widely used measure of classifier performance: the ROC AUC (area under the curve).
To obtain the ROC AUC in Python use the code below. Recall tha we obtained fpr, tpr previously with roc_curve,
Code
from sklearn.metrics import roc_curve, aucauc(fpr, tpr)
0.924750412985707
Random Classifier.
The random classifier assigns scores to the observations using a uniform distribution in \([0, 1]\) without regard to the real classes. Therefore, for any given threshold \(0\leq p\leq 1\) the expected amount of positive predictions is \(p\), both for real negatives and for real positives. That means \[\dfrac{FP}{FP + TN} = \dfrac{TP}{TP + FN}\] That is fpr = tpr.And this means that foor a random classifier we expect the point (fpr, tpr) to be in the diagonal that we represented as a dashed line in the ROC curve plot.
We emphasize that this are expected tpr and fpr values because in a specific implementation of a random classifier the values will usually deviate from their theoretical expectation.
The area under a random classifiier ROC is 0.5. A classifier with a value close to this one is almost random and a classifier with smaller AUC is called worse than random.
Exercise 010
Implement the random classifier and check its fpr, tpr values.
Code
# %load "../exclude/code/2_2_Exercise_010.py"
Scores and Calibration
In the previous session we explained that many classifiers, even when using the soft approac, return ordered scores and not probabilities. That is why, when we need to interpret these scores as probabilities we need to think about calibration. Recall that a perfectly calibrated scoring means that if we take 100 predictions of the model, all of them with score 0.7, then seventy of them shoul be real positives and 30 should be real negatives.
IN the case of logistic regression the model is based on estimated probabilities. So we usually expect logistic classifiers to be fairly well calibrated. Note however that there are factors (i.e. imbalance) that affect calibration, and so perfect calibration is not attained in practice.
Calibration Curves
In order to assess if a classifier is well calibrated we can bin the scores it outputs and compute the relative frequency of the positive class in each bin. If the classifier is well calibrated when we plot the bin centers against the relative frequency, the points should lie in or close to the diagonal line of the unit square. This is what a calibration curve is for, to check how close to the ideal is the output of our classifier.
Calibration Curves with scikit-learn
The code is similar to what we did for ROC curves. The predictions of the logistic classifier are reasonably close to the dashed line.
These are histograms of the scores assigned by the classifier for real negatives (left panel) and real positives (right panel). For a perfect classifier you should see a single red bar at the leftmost position for \(Y=0\) and a single blue bar at the righmost position for \(Y=1\). But for non perfect classifiers the scores leak to the rest of the interval. The wider their distribution, the worse the performance in the class indicated by the value of \(Y\).
Keep in mind that the vertical scales are not the same! The code to obtain this plots is a bit more involved so use the magic %load to inspect it.
Code
%run -i "2_2_probability_histograms.py"
Performance Measures and Plots for the Test Set
Everything that we have done regarding performance has used the training set. In order to see how our model behaves with data that it has not seen before we should now explore the performance of the model using the test set.
The following exercise is very important!
Exercise 011
Create a dataframe called dfTS_eval from XTS and YTS
Obtain the test set predictions of the model (both class and probability) and store them in dfTS_eval.
Get the confusion matrix, accuracy, sensitivity, specificity, precision, f1 and kappa for the test set.
Find the ROC AUC and plot the ROC curve, calibration curve and probability histograms.
Compare your results with those from training. What are you conclusions?
Let us Talk about Books
Along the course we will introduce book and web references we consider worth knowing. We begin with a list of recommended books (details in the references section below, pdf and html versions of this document):
Logistic Regression models are well studied and widely used models and we have barely scratched the surface of the topic. They are however limited to problems with a linear boundary between classes (unless we add well chosen nonlinear terms). They are fast to train and predict, they do not overfit easily but they are not flexible enough for more complex problems and they do not extend easily to multiclass problems.
In our next session we will meet the KNN models. This will lead us to consider the notion of hyperparameter of a model. And that in turn will open a deeper discussion about model validation and generalization.
Hastie, Trevor J, Robert John Tibshirani, and Jerome H Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. https://hastie.su.domains/ElemStatLearn.
James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor. 2023. An Introduction to Statistical Learning with Applications in Python. Springer International Publishing. https://doi.org/10.1007/978-3-031-38747-0.